63 research outputs found
Bayesian Reinforcement Learning via Deep, Sparse Sampling
We address the problem of Bayesian reinforcement learning using efficient
model-based online planning. We propose an optimism-free Bayes-adaptive
algorithm to induce deeper and sparser exploration with a theoretical bound on
its performance relative to the Bayes optimal policy, with a lower
computational complexity. The main novelty is the use of a candidate policy
generator, to generate long-term options in the planning tree (over beliefs),
which allows us to create much sparser and deeper trees. Experimental results
on different environments show that in comparison to the state-of-the-art, our
algorithm is both computationally more efficient, and obtains significantly
higher reward in discrete environments.Comment: Published in AISTATS 202
Marich: A Query-efficient Distributionally Equivalent Model Extraction Attack using Public Data
We study black-box model stealing attacks where the attacker can query a
machine learning model only through publicly available APIs. Specifically, our
aim is to design a black-box model extraction attack that uses minimal number
of queries to create an informative and distributionally equivalent replica of
the target model. First, we define distributionally equivalent and
max-information model extraction attacks. Then, we reduce both the attacks into
a variational optimisation problem. The attacker solves this problem to select
the most informative queries that simultaneously maximise the entropy and
reduce the mismatch between the target and the stolen models. This leads us to
an active sampling-based query selection algorithm, Marich. We evaluate Marich
on different text and image data sets, and different models, including BERT and
ResNet18. Marich is able to extract models that achieve of true
model's accuracy and uses samples from the publicly available
query datasets, which are different from the private training datasets. Models
extracted by Marich yield prediction distributions, which are
closer to the target's distribution in comparison to the existing active
sampling-based algorithms. The extracted models also lead to accuracy
under membership inference attacks. Experimental results validate that Marich
is query-efficient, and also capable of performing task-accurate,
high-fidelity, and informative model extraction.Comment: Presented in the Privacy-Preserving AI (PPAI) workshop at AAAI 2023
as a spotlight tal
Interactive and Concentrated Differential Privacy for Bandits
Bandits play a crucial role in interactive learning schemes and modern
recommender systems. However, these systems often rely on sensitive user data,
making privacy a critical concern. This paper investigates privacy in bandits
with a trusted centralized decision-maker through the lens of interactive
Differential Privacy (DP). While bandits under pure -global DP have
been well-studied, we contribute to the understanding of bandits under zero
Concentrated DP (zCDP). We provide minimax and problem-dependent lower bounds
on regret for finite-armed and linear bandits, which quantify the cost of
-global zCDP in these settings. These lower bounds reveal two hardness
regimes based on the privacy budget and suggest that -global zCDP
incurs less regret than pure -global DP. We propose two -global
zCDP bandit algorithms, AdaC-UCB and AdaC-GOPE, for finite-armed and linear
bandits respectively. Both algorithms use a common recipe of Gaussian mechanism
and adaptive episodes. We analyze the regret of these algorithms to show that
AdaC-UCB achieves the problem-dependent regret lower bound up to multiplicative
constants, while AdaC-GOPE achieves the minimax regret lower bound up to
poly-logarithmic factors. Finally, we provide experimental validation of our
theoretical results under different settings
SAAC: Safe Reinforcement Learning as an Adversarial Game of Actor-Critics
Although Reinforcement Learning (RL) is effective for sequential
decision-making problems under uncertainty, it still fails to thrive in
real-world systems where risk or safety is a binding constraint. In this paper,
we formulate the RL problem with safety constraints as a non-zero-sum game.
While deployed with maximum entropy RL, this formulation leads to a safe
adversarially guided soft actor-critic framework, called SAAC. In SAAC, the
adversary aims to break the safety constraint while the RL agent aims to
maximize the constrained value function given the adversary's policy. The
safety constraint on the agent's value function manifests only as a repulsion
term between the agent's and the adversary's policies. Unlike previous
approaches, SAAC can address different safety criteria such as safe
exploration, mean-variance risk sensitivity, and CVaR-like coherent risk
sensitivity. We illustrate the design of the adversary for these constraints.
Then, in each of these variations, we show the agent differentiates itself from
the adversary's unsafe actions in addition to learning to solve the task.
Finally, for challenging continuous control tasks, we demonstrate that SAAC
achieves faster convergence, better efficiency, and fewer failures to satisfy
the safety constraints than risk-averse distributional RL and risk-neutral soft
actor-critic algorithms
BelMan: Bayesian Bandits on the Belief--Reward Manifold
We propose a generic, Bayesian, information geometric approach to the
exploration--exploitation trade-off in multi-armed bandit problems. Our
approach, BelMan, uniformly supports pure exploration,
exploration--exploitation, and two-phase bandit problems. The knowledge on
bandit arms and their reward distributions is summarised by the barycentre of
the joint distributions of beliefs and rewards of the arms, the
\emph{pseudobelief-reward}, within the beliefs-rewards manifold. BelMan
alternates \emph{information projection} and \emph{reverse information
projection}, i.e., projection of the pseudobelief-reward onto beliefs-rewards
to choose the arm to play, and projection of the resulting beliefs-rewards onto
the pseudobelief-reward. It introduces a mechanism that infuses an exploitative
bias by means of a \emph{focal distribution}, i.e., a reward distribution that
gradually concentrates on higher rewards. Comparative performance evaluation
with state-of-the-art algorithms shows that BelMan is not only competitive but
can also outperform other approaches in specific setups, for instance involving
many arms and continuous rewards.Comment: 36 pages, 14 figures, accepted in ECML PKDD 201
- …